-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline Modifications for variable Audio and Sampler Generalisation to different Download Tasks #110
Conversation
necessary_keys = sampler_config["necessary_keys"] | ||
|
||
def requires(self): | ||
return _get_download_and_extract_tasks(self.task_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrmmm really? I mean this is the sampler so it's not core code, it's just for testing, but this seems wrong and smells bad.
requires() in luigi is supposed to return a Luigi task. Not do work. run() is where work should occur. Otherwise you might have weird luigi bugs that are hard to debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_download_and_extract_tasks
returns the tasks which will download and extract. So technically, the requires
is still returning a list of tasks. This is how our main pipeline is working where we use this function to build the task and then pass it in the ExtractMetadata
as luigiParameter
and put it in the requires. As discussed in the previous comment, the main reason to do this here, is that the _get_download_and_extract_tasks is different for different tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hearpreprocess/sampler.py
Outdated
"get_download_and_extract_tasks" | ||
] | ||
|
||
class RandomSampleOriginalDataset(_RandomSampleOriginalDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have RandomSampleOriginalDataset and _RandomSampleOriginalDataset? Why can't we just have only RandomSampleOriginalDataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason to do this here like this is - The RandomSampleOriginalDataset
requires the get_download_and_extract_task, which is a function and is specific to the task. This returns the task to download and extract the task. So the _RandomSampleOriginalDataset
is overridden and named as RandomSampleOriginalDataset
and the tasks are added. I cannot refer to both the global variable _RandomSampleOriginalDataset
and local variable RandomSampleOriginalDataset
inside the get_sampler_task
function with the same name. So I had to make different names for them.
@@ -160,6 +160,9 @@ def get_audio_dir_stats( | |||
all_file_paths, | |||
) | |||
) | |||
if len(audio_paths) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait why would this happen? Why is this okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no audio in the downloaded directory, we should return nothing. This can happen, for example, if a downloaded directory has just metadata files, we still need to put that in the requires of the ExtractMetadata
, so that the task can actually run. This function, the one here, runs on all the downloaded directories in the requires. So, this is not a problem as we still have another assert below, So that if audio files are found and the stats are not calculated, the assert will throw, if audio files are not found, which is in this case, it will return the empty dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hearpreprocess/util/task_config.py
Outdated
# should also be None and no subsampling will be done | ||
if task_config["sample_duration"] is None: | ||
schema["max_task_duration_by_split"] = Schema( | ||
{split: Or(int, float, None) for split in SPLITS} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None, not int, float, None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
No description provided.